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^ I Abstract 

^ We investigate research and development collaborations under the EU 

O Framework Programs (FPs) for Research and Technological Development. 

^ The collaborations in the FPs give rise to bipartite networks, with edges 

CZ3 existing between projects and the organizations taking part in them. A 

_ ^ version of the modularity measure, adapted to bipartite networks, is pre- 

C/2 sented. Communities are found so as to maximize the bipartite modu- 

larity. Projects in the resulting communities are shown to be topically 

differentiated. 

^ 1 Introduction 

> 

04 The EU Framework Programs (FPs) for Research and Technological Develop- 

ment were implemented to follow two main strategic objectives: First, strength- 
ening the scientific and technological bases of European industry to foster in- 
ternational competitiveness and, second, the promotion of research activities in 
support of other EU policies. In spite of their different scopes, the fundamental 
rationale of the FPs has remained unchanged. All FPs share a few common 

^-H structural key elements. First, only projects of limited duration that mobilize 

^ private and public funds at the national level are funded. Second, the focus of 

. ^ funding is on multinational and multi-actor collaborations that add value by 

operating at the European level. Third, project proposals are to be submitted 
^ by self-organized consortia and the selection for funding is based on specific sci- 



entific excellence and socio-economic relevance criteria ( Roediger-Schluga and 



Barber 2006). By considering the constituents of these consortia, we can rep- 
resent and analyze the FPs as networks of projects and organizations. The 
resulting networks are of substantial size, including over 50 thousand projects 
and over 30 thousand organizations. 
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We have general interest in studying a real-world network of large size and 
high complexity from a methodological point of view. Furthermore, socio- 
economic research emphasizes the central importance of collaborative activi- 
ties in R&D for economic competitiveness (see, for instance, Fagerberg et al. 



2005 among many others). Mainly for reasons of data availability, attempts to 
evaluate quantitatively the structure and function of the large social networks 
generated in the EU FPs have begun only in the last few years, using social 



network analysis and complex networks methodologies 


Almendral et al. ( 


2007 


I; 


Barber et al. 


(2006 


2008 


); Breschi and Cusmano 


(2004 


); 


Roediger-Schluga and 



Barber (2008). Studies to date point to the presence of a dense and hierarchical 
network. A highly connected core of frequent participants, taking leading roles 
within consortia, is linked to a large number of peripheral actors, forming a 
giant component that exhibits the characteristics of a small world. 

Networks have attracted a burst of attention in the last decade (for useful 



reviews, see Albert and Barabasi 


20021 IChristensen and Albert||2007||Dorogovt-| 


sev and Mendes 2004 


Newman 


2003 1 , with applications to natural, social, and 



technological networks. Of great current interest is the identication of commu- 
nity structure within networks. Stated informally, a community is a portion of 
the network whose members are more tightly linked to one another than to other 
members of the network. A variety of approaches ( Angelini et al. 2007! 'Clauset 
eTaL] [ 2004j [Girvan and Newman[ [2002 ^ 



IGol'dshtein and Koganov 2006; Hast 



mgs, 2006; Ne wman and Gir van, 200^ [Newman and Leicht[ [20071 |Palla et al 
2005( |Reichardt and Bornholdt( |2006| ha ve been taken to explore this concept; 
Danon et al. (2005) and Newman ( [2004b[ ) provide useful reviews. Detecting the 
community structure allows quantitative investigation of relevant heterogeneous 
substructures formed in the network. 

We investigate networks of research and development collaborations under 
the FPs. The collaborations in the FPs give rise to bipartite networks, with 
edges existing between projects and the organizations taking part in them. 
With this construction, participating organizations are linked only through joint 
projects. The resulting networks are quite large compared to typical social net- 
works, containing tens of thousands of vertices and edges. At this scale, visu- 
alization of the networks is quite difficult, so we instead take an algorithmic 
approach to community identification. A version of the modularity measure 
(Newman and Girvan 2004), adapted to bipartite networks (Barber 2007), is 
used to assess the quality of a division of the vertices into communities. Com- 
munities are found by maximizing the bipartite modularity. We consider topical 
differentiation of the communities found. 

The rest of the paper is structured as follows. In section [2] we discuss the 
data used on the FPs, continuing with definition of networks from the data in 
section[3l We present in section[4[a summary of methods for identifying network 
communities, and apply the methods to the FP networks in section [5j Finally, 
we discuss the consequences of our findings in section [6[ . 



2 Data Preparation 

We draw on the latest version of the sysres EUPRO database. This database in- 
cludes all information publicly available through the CORDIS projects databas^ 

' http : //cordis . europa . eu 
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and is maintained by ARC systems research (ARC sys). The sysres EUPRO 
database presently comprises data on funded research projects of the EU FPs 
(complete for FP1~FP5, and about 70% complete for FP6) and all participat- 
ing organizations. It contains systematic information on project objectives and 
achievements, project costs, project funding and contract type, as well as in- 
formation on the participating organizations including the full name, the full 
address and the type of the organization. 

For purposes of network analyses, the main challenge is the inconsistency 
of the raw data. Apart from incoherent spelling in up to four languages per 
country, organizations are labelled inhomogeneously. Entries may range from 
large corporate groupings, such as EADS, Siemens and Philips, or large public 
research organizations, like CNR, CNRS and CSIC, to individual departments 
and labs. 

Due to these shortcomings, the raw data is of limited use for meaningful 
network analyses. Further, any fully automated standardization procedure is 
infeasible. Instead, a labor-intensive, manual data-cleaning process is used in 



building the database. Roediger-Schluga and Barber (2008) describe the data- 



cleaning process in detail; here, we restrict discussion to the steps of the process 
relevant to the present work. These are: 

1. Identification of unique organization name. Organizational boundaries are 
defined by legal control. Entries are assigned to appropriate organizations 
using the more recently available organization name. Most records are 
easily identified, but, especially for firms, organization names may have 
changed frequently due to mergers, acquisitions, and divestitures. 

2. Creation of subentities. This is the key step for mitigating the bias that 
arises from the different scales at which participants appear in the data 
set. Ideally, we use the actual group or organizational unit that partici- 
pates in each project, but this information is only available for a subset of 
records, particularly in the case of firms. Instead, subentities that operate 
in fairly coherent activity areas are pragmatically defined. Wherever pos- 
sible, subentities are identified at the second lowest hierarchical tier, with 
each subentity comprising one further hierarchical sub-layer. Thus, uni- 
versities are broken down into faculties/schools, consisting of departments; 
research organizations are broken down into institutes, activity areas, etc., 
consisting of departments, groups or laboratories; and conglomerate firms 
are broken down into divisions, subsidiaries, etc. Subentities can fre- 
quently be identified from the contact information even in the absence 
of information on the actual participating organizational unit. Note that 
subentities may still vary considerably in scale. 

3. Regionalization. The data set has been regionalized according to the Euro- 
pean Nomenclature of Territorial Units for Statistics (NUTS) classification 
systeirj^ where possible to the NUTS3 level. Mostly, this has been done 
via information on postal codes. 

Due to resource limitations, only organizations appearing more than thirty times 
in the standardization table for FP1~FP5 have thus far been processed. This 



^NUTS is a hierarchical system of regions used by the statistical office of the European 
Community for the production of regional statistics. At the top of the hierarchy are NUTS-0 
regions (countries) below which are NUTS-1 regions and then NUTS-2 regions, etc. 
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could bias the results; however, the networks have a structure such that the size 
of the bias is quite low dRoediger-Schluga and Barber 2008). 



3 Network Definition 



Using the sysres EUPRO database, for each FP we construct a network con- 
taining the collaborative projects and all organizational subentitie^ that are 
participants in those projects. An organization is linked to a project if and 
only if the organization is a member of the project. Since an edge never exists 
between two organizations or two projects, the network is bipartite. The net- 
work edges are unweighted; in principle, the edges could be assigned weights 
to reflect the strength of the participation, but the data needed to assign the 
network weights is not available. 

Previous investigations of the FPs often have made use of one-mode pro- 



jection networks (Almendral et al. 


2007 


Barber et al. 


2006 


Breschi and Cus- 


mano 


2004 


Roediger-Schluga and Barber 


2008 


), especially for the organiza- 



tions. While the projection networks can be useful, the construction of the 
projections intrinsically loses information available in the bipartite networks, 
which can lead to incorrect community structures (Guimera et al. 2007). In 
the present work, we thus focus exclusively on representations of the Framework 
Programs as bipartite networks. 



4 Community Structure 



Of great current interest is the identification of community groups, or mod- 
ules, within networks. Stated informally, a community group is a portion of the 
network whose members are more tightly linked to one another than to other 



2005 



members of the network. A variety of approaches ( 


Angelini et al. 


2007 


'Clauset 


et al. 


2004 


Girvan and Newmar 


I 120021 IGol'dshtein and Koganov 2006; Hast- 


ings 12006 


Newman and Girvan 


2004; Newman and Leicht 2007 [Palla et al.| 



Reichardt and Bornholdt 2006|) have been taken to explore this concept; 
Danon et al. (2005) and Newman (2004b I provide useful reviews. Detecting 



community groups allows quantitative investigation of relevant subnetworks. 
Properties of the subnetworks may differ from the aggregate properties of the 
network as a whole, e.g., modules in the World Wide Web are sets of topically 
related web pages. Thus, identification of community groups within a network 
is a first step towards understanding the heterogeneous substructures of the 
network. 

Methods for identifying community groups can be specialized to distinct 



classes of networks, such as bipartite networks (Barber 2007 Guimera et al 



2007). This is immediately relevant for our study of the FP networks, allowing 
us to examine the community structure in the bipartite networks. Communities 
are expected to be formed of groups of organizations engaged in R&D into 
similar topics, and the projects in which those organizations take part. 



^We work exclusively at the subentity level, and will interchangably refer to organizations 
and subentities. 
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4.1 Modularity 



To identify communities, we take as our starting point the modularity, intro- 
duced by Newman and Girvan (2004). Modularity makes intuitive notions of 



community groups precise by comparing network edges to those of a null model. 
The modularity Q is proportional to the difference between the number of edges 
within communities c and those for a null model: 



1 

2M 



E E 



(1) 



Along with eq. ([T]), it is necessary to provide a null model, defining P^j. 

The standard choice for the null model constrains the degree distribution for 
the vertices to match the degree distribution in the actual network. Random 



graph models of this sort are obtained (Chung and Lu 2002) by putting an 
edge between vertices i and j at random, with the constraint that on average 
the degree of any vertex z is d^. This constrains the expected adjacency matrix 
such that 



^ Ea 



(2) 



Denote E (Aij) by Pij and assume further that Pij factorizes into 

Pij — PiPj 7 



leading to 



P 



^ did J 



(3) 



(4) 



A consequence of the null model choice is that Q ^ when all vertices are in 
the same community. 

The goal now is to find a division of the vertices into communities such that 
the modularity Q is maximal. An exhaustive search for a decomposition is out 
of the question: even for moderately large graphs there are far too many ways 
to decompose them into communities. Fast approximate algorithms do exist 



(see, for example Newman 2004a Pujol et al. 2006)) 



4.2 Finding Communities in Bipartite Networks 



Specific classes of networks have additional constraints that can be reflected 
in the null model. For bipartite graphs, the null model should be modified to 
reproduce the characteristic form of bipartite adjacency matrices: 



O M 



O 



(5) 



Recently, specialized modularity measures and search algorithms have been pro- 



posed for finding communities in bipartite networks (Barber 2007 Guimera 



et al. 2007 1 . These measures and methods have not been studied as extensively 



as the versions with the standard null model shown above, but many of the algo- 
rithms can be adapted to the bipartite versions without difficulty. Limitations 



of modularity-based methods (e.g., the resolution limit described by Fortunato 



and Barthelemy 2007) are expected to hold as well. 
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Community identification is then a search for high modularity partitions 
of the vertices into disjoint sets. An exhaustive search for the globally optimal 
solution is only feasible for the smallest networks, as the number of possible par- 
titions of the vertices grows far too rapidly with network size. Several heuristics 
exist to find high-quality, if suboptimal, solutions in a reasonable length of time. 
For the FP networks, we use a two-stage search procedure: 

1. Agglomerative hierarchical clustering, where small communities are suc- 
cessively joined into larger ones such that the modularity increases. This 



stage is based on the so-called fast modularity (FM) algorithm (Clauset 
etani2004|. 



2. Greedy search, where vertices are moved amongst existing communities to 
ensure the resulting partition is at a local optimum of modularity. This 
stage uses the bipartite, recursively induced modules (BRIM) algorithm 
(|Barberl|2007|. 



The coarse structure is found with FM, with incremental improvements provided 
by BRIM. 

In principle, the above approach should be continued until a maximum in the 
modularity is found. In practice, an excessively large number of communities 
for visualization purposes can result, obscuring the core community structure. 
To deal with this difficulty, communities are further merged, so long as the 
modularity stays near the maximum (within «90%) and the general structure 
of the communities is maintained (as determined with information theoretic 



methods, see Danon et al. 2005). 



5 Communities in the Framework Program Net- 
works 

In fig. [T] we show a community structure for FP5, found as described above, 
with a modularity of Q = 0.644 for 25 community groups. The communities 
are shown as vertices in a network, with the vertex positions determined us- 
ing spectral methods (Seary and Richards 20031. The area of each vertex is 



proportional to the number of edges from the original network within the cor- 
responding community. The width of each edge in the community network is 
proportional to the number of edges in the original network connecting commu- 
nity members from the two linked groups. The vertices and edges are shaded to 
provide additional information about their topical structure, as described in the 
next section. Each vertex is numbered, with numbers assigned starting from 1 
based on the size of the communities, with the largest communities having the 
smallest numbers. 

The networks from all of the FPs show definite community structure. In 
each case, the modularity exceeds 0.6 (see table [T]). 

5.1 Topical Profiles of Communities 

Projects are assigned one or more standardized subject indices. There are 49 
subject indices in total, ranging from Aerospace to Waste Management. We 
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Figure 1: Community groups in the bipartite network of projects and organiza- 
tions for FP5. 



FP Number of Communities Modularity 



2 16 0.641 

3 14 0.627 

4 25 0.662 

5 25 0.644 

6 25 0.632 



Table 1: Communities in the FP networks. In each network, definite community 
structure is observed. 
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denote by 

f{t)>0 (6) 
the frequency of occurrence of the subject index t in the network, with 



Similarly we consider the projects within one community c and the frequency 



fc (t) > 



(8) 



of any subject index t appearing in the projects only of that community. We call 
fc the topical profile of community c to be compared with that of the network 
as a whole. 

Topical differentiation of communities can be measured by comparing their 
profiles, among each other or with respect to the overall network. This can 
be done in a variety of ways ( Gibbs and Su[ 2002), such as by the KuUback 
"distance" 

V- . .... fcitl 

fit) ■ 



(9) 



A true metric is given by 



4 = 



(10) 



ranging from zero to two. 

Topical differentiation is illustrated in figs. 2(a) and |2(b)[ In the figure, 
example profiles are shown, taken from the network in fig. Ill The community- 
specific profiles correspond to the communities 1 and 2 in fig. [T] Both commu- 
nities are topically differentiated from the network as a whole, and with similar 
extent {di = 0.69, di = 0.66). However, the actual topics are quite different, 
with community 1 dominated by subject indices relating to biotechnology and 
the life sciences, while community 2 is dominated by subject indices relating to 
manufacturing and transport. 



6 Discussion 

We have presented an investigation of networks derived from the European 
Union's Framework Programs for Research and Technological Development. 
The networks are of substantial size, complexity, and economic importance. We 
have attempted to provide a coherent picture of the complete process, beginning 
with data preparation and network definition, then continuing with analysis of 
the network community structure. 

We first considered the challenges involved in dealing with a large amount of 
imperfect data, detailing the tradeoffs made to clean the raw data into a usable 
form under finite resource constraints. The processed data was used to define 
bipartite networks with vertices consisting of all the projects and organizational 
subentities involved in each FP. 

Next we analyzed the community structure of the Framework Programs. Us- 
ing a modularity measure and search algorithm adapted to bipartite networks. 
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(b) Topical profile of community 2. 

Figure 2: Community 1 shows strong topical differentiation {di = 0.69) from 
the network as a whole, being dominated by topics in biotechnology and the 
life sciences. Community 2 also shows strong topical differentiation (^2 = 0.66) 
from the network as a whole. Further, it is quite distinct from community 1, 
being dominated by manufacturing- and transport-related topics. 
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we identified communities from tfie networks. We found that the communi- 
ties are topically differentiated based on the standardized subject indices for 
Framework Program projects. 

The communities identified will serve as basis for further studies of the 
Framework Programs. A natural extension of the present work is to exam- 
ine other properties by which the communities are differentiated. Properties of 
organizations making up the communities can be explored, much as the subject 
indices for the projects were examined in this work. Immediate candidates for 
consideration include the types of the organizations (e.g., universities or firms) 
and geographical location of the organizations (e.g., as countries or using NUTS 
classifications). Further, the communities will be used as a basis for modeling 



determinants of partner choice, such as with spatial interaction models (Schern 
gell and Barber 2008a|b I or binary choice models ( Paier and Scherngell 2008 1 , 



providing insight into the formation rules at work in heterogeneous subsets of 
the Framework Programs. 
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